132 research outputs found

    DiviML: A Module-based Heuristic for Mapping Neural Networks onto Heterogeneous Platforms

    Full text link
    Datacenters are increasingly becoming heterogeneous, and are starting to include specialized hardware for networking, video processing, and especially deep learning. To leverage the heterogeneous compute capability of modern datacenters, we develop an approach for compiler-level partitioning of deep neural networks (DNNs) onto multiple interconnected hardware devices. We present a general framework for heterogeneous DNN compilation, offering automatic partitioning and device mapping. Our scheduler integrates both an exact solver, through a mixed integer linear programming (MILP) formulation, and a modularity-based heuristic for scalability. Furthermore, we propose a theoretical lower bound formula for the optimal solution, which enables the assessment of the heuristic solutions' quality. We evaluate our scheduler in optimizing both conventional DNNs and randomly-wired neural networks, subject to latency and throughput constraints, on a heterogeneous system comprised of a CPU and two distinct GPUs. Compared to na\"ively running DNNs on the fastest GPU, he proposed framework can achieve more than 3×\times times lower latency and up to 2.9×\times higher throughput by automatically leveraging both data and model parallelism to deploy DNNs on our sample heterogeneous server node. Moreover, our modularity-based "splitting" heuristic improves the solution runtime up to 395×\times without noticeably sacrificing solution quality compared to an exact MILP solution, and outperforms all other heuristics by 30-60% solution quality. Finally, our case study shows how we can extend our framework to schedule large language models across multiple heterogeneous servers by exploiting symmetry in the hardware setup. Our code can be easily plugged in to existing frameworks, and is available at https://github.com/abdelfattah-lab/diviml.Comment: accepted at ICCAD'2

    BRAMAC: Compute-in-BRAM Architectures for Multiply-Accumulate on FPGAs

    Full text link
    Deep neural network (DNN) inference using reduced integer precision has been shown to achieve significant improvements in memory utilization and compute throughput with little or no accuracy loss compared to full-precision floating-point. Modern FPGA-based DNN inference relies heavily on the on-chip block RAM (BRAM) for model storage and the digital signal processing (DSP) unit for implementing the multiply-accumulate (MAC) operation, a fundamental DNN primitive. In this paper, we enhance the existing BRAM to also compute MAC by proposing BRAMAC (Compute-in-BR\underline{\text{BR}}AM A\underline{\text{A}}rchitectures for M\underline{\text{M}}ultiply-Ac\underline{\text{Ac}}cumulate). BRAMAC supports 2's complement 2- to 8-bit MAC in a small dummy BRAM array using a hybrid bit-serial & bit-parallel data flow. Unlike previous compute-in-BRAM architectures, BRAMAC allows read/write access to the main BRAM array while computing in the dummy BRAM array, enabling both persistent and tiling-based DNN inference. We explore two BRAMAC variants: BRAMAC-2SA (with 2 synchronous dummy arrays) and BRAMAC-1DA (with 1 double-pumped dummy array). BRAMAC-2SA/BRAMAC-1DA can boost the peak MAC throughput of a large Arria-10 FPGA by 2.6×\times/2.1×\times, 2.3×\times/2.0×\times, and 1.9×\times/1.7×\times for 2-bit, 4-bit, and 8-bit precisions, respectively at the cost of 6.8%/3.4% increase in the FPGA core area. By adding BRAMAC-2SA/BRAMAC-1DA to a state-of-the-art tiling-based DNN accelerator, an average speedup of 2.05×\times/1.7×\times and 1.33×\times/1.52×\times can be achieved for AlexNet and ResNet-34, respectively across different model precisions.Comment: 11 pages, 13 figures, 3 tables, FCCM conference 202

    Effect of sex on meat quality characteristics

    Get PDF
    The goal of the current work was to determine the quality of male and female cattle and buffalo meat, as well as a trial for improving meat quality by feeding the experimental groups from cattle and buffalo for four months in rations containing 16.5% protein. During the year 2021, eighty samples of cattle and buffalo were obtained from butcher shops in Luxor, Egypt. The samples were divided into four categories: male cattle, female cattle, male buffalo, and female buffalo each class was represented by 20 samples. Trails for improvement of the nutritional content of meat by feeding ration containing 16.5% protein to male cattle and buffalo, each class was represented by 10 animals. The samples were analyzed for determination of moisture, protein, fat, ash, carbohydrate, energy percentage, cooking loss, water holding capacity, tenderness as well as cholesterol determined in perinephric fat. Male beef was characterized by greater protein content (18.02% ± 0.35%). On the other hand, male buffalo was characterized by low fat content and cholesterol levels of 1.60% ± 0.85% and 294.30 ± 2.40 mg/100gm, respectively. The experimental male cattle showed the highest protein percentage (18.50% ± 0.37%) and lowest cholesterol level (267.19 ± 6.25 mg /100 gm). The use of the experimental ration could improve the quality of male beef in terms of protein value as well as cholesterol level

    The power of communication: Energy-efficient NOCS for FPGAS

    Full text link
    Integrating networks-on-chip (NoCs) on FPGAs can improve device scalability and facilitate design by abstracting com-munication and simplifying timing closure, not only between modules in the FPGA fabric but also with large “hard ” blocks such as high-speed I/O interfaces. We propose mixed and hard NoCs that add less than 1 % area to large FPGAs and run 5-6 × faster than the soft NoC equivalent. A detailed power analysis, per NoC component, shows that routers con-sume 14 × less power when implemented hard compared to soft, and whether hard or soft most of the router’s power is consumed in the input modules for buffering. For com-plete systems, hard NoCs consume less than 6 % (and as low as 3%) of the FPGA’s dynamic power budget to support 100 GB/s of communication bandwidth. We find that, de-pending on design choices, hard NoCs consume 4.5-10.4 mJ of energy per GB of data transferred. Surprisingly, this is comparable to the energy efficiency of the simplest tradi-tional interconnect on an FPGA – soft point-to-point links require 4.7 mJ/GB. In many designs, communication must include multiplexing, arbitration and/or pipelining. For all these cases, our results indicate that a hard NoC will be more energy efficient than the conventional FPGA fabric. 1

    Design tradeoffs for hard and soft FPGA-based Networks-on-Chip

    Full text link
    FPGAs has the potential not only to improve the efficiency of the interconnect, but also to increase designer productivity and reduce compile time by raising the abstraction level of communication. By comparing NoC components on FPGAs and ASICs we quantify the efficiency gap between the two platforms and use the results to understand the design tradeoffs in that space. The crossbar has the largest FPGA vs. ASIC gaps: 85× area and 4.4 × delay, while the input buffers have the smallest: 17 × area and 2.9 × delay. For a soft NoC router, these results indicate that wide datapaths, deep buffers and a small number of ports and virtual channels (VC) are favorable for FPGA implementation. If one hardens a complete state-of-the-art VC router it is on average 30 × more area efficient and can achieve 3.6 × the maximum frequency of a soft implementation. We show that this hard router can be integrated with the soft FPGA interconnect, and still achieve an area improvement of 22×. A 64-node NoC of hard routers with soft interconnect utilizes area equivalent to 1.6 % of the logic modules in the latest FPGAs, compared to 33 % for a soft NoC. I

    Are We There Yet? Product Quantization and its Hardware Acceleration

    Full text link
    Conventional multiply-accumulate (MAC) operations have long dominated computation time for deep neural networks (DNNs). Recently, product quantization (PQ) has been successfully applied to these workloads, replacing MACs with memory lookups to pre-computed dot products. While this property makes PQ an attractive solution for model acceleration, little is understood about the associated trade-offs in terms of compute and memory footprint, and the impact on accuracy. Our empirical study investigates the impact of different PQ settings and training methods on layerwise reconstruction error and end-to-end model accuracy. When studying the efficiency of deploying PQ DNNs, we find that metrics such as FLOPs, number of parameters, and even CPU/GPU performance, can be misleading. To address this issue, and to more fairly assess PQ in terms of hardware efficiency, we design the first custom hardware accelerator to evaluate the speed and efficiency of running PQ models. We identify PQ configurations that are able to improve performance-per-area for ResNet20 by 40%-104%, even when compared to a highly optimized conventional DNN accelerator. Our hardware performance outperforms recent PQ solutions by 4x, with only a 0.6% accuracy degradation. This work demonstrates the practical and hardware-aware design of PQ models, paving the way for wider adoption of this emerging DNN approximation methodology

    α-Globin Messenger Ribonucleic Acid as a Molecular Marker for Determining the Age of Human Blood Spots in Different Temperatures

    Get PDF
    Background: Analyzing recovered evidence, such as blood which is one of the most encountered types of biological evidence, can provide information to establish the definite time when a crime was committed. This study aims to investigate the time- and temperature-related effects on human bloodstain’s α-globin messenger RNA expression and to estimate the bloodstain’s age using α-globin mRNA. Methods: A total of 22 blood samples were collected from healthy middle-aged volunteers (12 women and 10 men). After preparation, the samples were exposed to temperatures of 4°C, 24°C, and 40°C. Next, the mRNA expression of the α-globin gene was quantified by real-time RT-PCR at different time intervals of 0, 30, 90, and 150 days.Results: The α-globin gene expression showed the highest mean values by 0 day and at 4°C and the lowest mean values by 150 days and at 40°C. Samples from male participants showed higher mean values of α-globin gene expression compared to their female counterparts. A significant negative correlation was detected between α-globin gene expression and time interval. Meanwhile, a regression equation was formulated to estimate the time interval using the α-globin gene concentration.Conclusion: α-Globin mRNA could be a useful marker to estimate the age of human blood spots

    Zero-Cost Proxies Meet Differentiable Architecture Search

    Full text link
    Differentiable neural architecture search (NAS) has attracted significant attention in recent years due to its ability to quickly discover promising architectures of deep neural networks even in very large search spaces. Despite its success, DARTS lacks robustness in certain cases, e.g. it may degenerate to trivial architectures with excessive parametric-free operations such as skip connection or random noise, leading to inferior performance. In particular, operation selection based on the magnitude of architectural parameters was recently proven to be fundamentally wrong showcasing the need to rethink this aspect. On the other hand, zero-cost proxies have been recently studied in the context of sample-based NAS showing promising results -- speeding up the search process drastically in some cases but also failing on some of the large search spaces typical for differentiable NAS. In this work we propose a novel operation selection paradigm in the context of differentiable NAS which utilises zero-cost proxies. Our perturbation-based zero-cost operation selection (Zero-Cost-PT) improves searching time and, in many cases, accuracy compared to the best available differentiable architecture search, regardless of the search space size. Specifically, we are able to find comparable architectures to DARTS-PT on the DARTS CNN search space while being over 40x faster (total searching time 25 minutes on a single GPU)
    corecore